Skip to content

Conversation

@andrwng
Copy link
Contributor

@andrwng andrwng commented Jan 6, 2026

Builds on top of the new LSM STM and introduces a new replicated_database abstraction that is intended to be opened only on leaders. It is an lsm::database whose data and metadata storage are backed by object storage for recoverability, and whose manifest is replicated through Raft.

After opening the database from the serialized manifest in the STM, leaders are expected to apply the remaining write batches from the volatile buffer before serving subsequence requests. This expectation is encoded in the replicated_database::open() call.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • None

@andrwng andrwng force-pushed the l1-replicated-db branch 3 times, most recently from 2a46d7c to 37e3199 Compare January 8, 2026 18:37
@andrwng andrwng marked this pull request as ready for review January 8, 2026 19:00
Copilot AI review requested due to automatic review settings January 8, 2026 19:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new replicated_database abstraction that wraps an LSM database with Raft-based replication. The database is leader-only and uses object storage for data persistence while replicating its manifest through Raft for fault tolerance.

Key changes:

  • Added timeout support to lsm::database::flush() to prevent indefinite blocking
  • Introduced memory_persistence_controller for testing failure scenarios
  • Implemented replicated_database class that coordinates LSM operations with Raft replication

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/v/lsm/lsm.h Added optional timeout parameter to flush method signature
src/v/lsm/lsm.cc Implemented timeout parameter forwarding in flush wrapper
src/v/lsm/io/memory_persistence.h Added controller struct for injecting failures in tests
src/v/lsm/io/memory_persistence.cc Implemented failure injection logic in memory persistence
src/v/lsm/db/tests/impl_test.cc Added test for flush timeout behavior
src/v/lsm/db/impl.h Updated flush signature with timeout parameter
src/v/lsm/db/impl.cc Implemented timeout enforcement in flush operation
src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc Comprehensive test suite for replicated database functionality
src/v/cloud_topics/level_one/metastore/lsm/tests/BUILD Build configuration for new test
src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.h Interface for Raft-replicated metadata persistence
src/v/cloud_topics/level_one/metastore/lsm/replicated_persistence.cc Implementation of replicated metadata persistence
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.h Header for replicated database abstraction
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc Core implementation of replicated database operations
src/v/cloud_topics/level_one/metastore/lsm/BUILD Build configuration for new libraries

@andrwng andrwng force-pushed the l1-replicated-db branch 2 times, most recently from c6676df to a69fd10 Compare January 8, 2026 22:25
@andrwng andrwng requested review from Lazin, dotnwat and rockwotj January 9, 2026 00:08
Copy link
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, a quick glance mostly lgtm. Will look more next week back at a computer

read_manifest(lsm::internal::database_epoch max_epoch) override {
_as.check();
auto _ = _gate.hold();
auto term_result = co_await _stm->sync(std::chrono::seconds(30));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to make all this abortable too? Maybe the io layer needs an abort source in the apis... Anyways not for this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it supposed to be invoked right after the leadership transfer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to make all this abortable too?

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it supposed to be invoked right after the leadership transfer?

Yea, the expectation is that this is called via opening the database before performing any updates on the database in a given term

read_manifest(lsm::internal::database_epoch max_epoch) override {
_as.check();
auto _ = _gate.hold();
auto term_result = co_await _stm->sync(std::chrono::seconds(30));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it supposed to be invoked right after the leadership transfer?

cloud_io::remote* remote,
const cloud_storage_clients::bucket_name& bucket,
ss::abort_source& as) {
auto term_result = co_await s->sync(std::chrono::seconds(30));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, is it expected to be invoked right after the leadership transfer or the start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected to be called upon becoming leader before replicating any LSM updates in the given term (hence all LSM updates go through an already opened replicated_database instance)

// Replay the writes in the volatile_buffer as writes to the database.
// These are writes that were replicated but not yet persisted to the
// manifest.
auto max_persisted_seqno = db.max_persisted_seqno();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this replay is not the same as the STM log replay? Here we're applying batches which are already stored by the STM (in other words they're applied to the STM but not to the LSM).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, STM log replay gets us the Raft replicated entries of the volatile buffer that have not yet been persisted in the LSM manifest. This replay here applies those write batches on top so the opened database is caught up to the tip of the committed log.

@andrwng andrwng force-pushed the l1-replicated-db branch 2 times, most recently from f14503f to cfd5991 Compare January 12, 2026 19:27
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#78915
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
EndToEndCloudTopicsTest test_write null integration https://buildkite.com/redpanda/redpanda/builds/78915#019bb3b7-0f88-4c50-b4a4-a8359b19aa0a FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndCloudTopicsTest&test_method=test_write

@andrwng andrwng requested review from Lazin, Copilot and rockwotj January 12, 2026 22:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Comment on lines +415 to +416
.row
= write_batch_row{.key = "key_before_reset", .value = iobuf::from("value_before_reset"),},
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line formatting breaks the designated initializer on a single line by placing the closing brace and comma separately. This should be reformatted to either fit on one line or break consistently across multiple lines for better readability.

Suggested change
.row
= write_batch_row{.key = "key_before_reset", .value = iobuf::from("value_before_reset"),},
.row = write_batch_row{
.key = "key_before_reset",
.value = iobuf::from("value_before_reset"),
},

Copilot uses AI. Check for mistakes.
volatile_row{
.seqno = lsm::sequence_number{100},
.row
= write_batch_row{.key = "reset_key", .value = iobuf::from("reset_value"),},
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line formatting breaks the designated initializer on a single line by placing the closing brace and comma separately. This should be reformatted to either fit on one line or break consistently across multiple lines for better readability.

Copilot uses AI. Check for mistakes.
volatile_row{
.seqno = lsm::sequence_number{100},
.row
= write_batch_row{.key = "reset_key", .value = iobuf::from("reset_value"),},
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line formatting breaks the designated initializer on a single line by placing the closing brace and comma separately. This should be reformatted to either fit on one line or break consistently across multiple lines for better readability.

Copilot uses AI. Check for mistakes.
Plumbs a new struct into memory persistence to allow tests to fail
operations. In the future this can be used to inject delays, randomized
failures, etc.
@andrwng
Copy link
Contributor Author

andrwng commented Jan 13, 2026

Force push to rebase on dev

rockwotj
rockwotj previously approved these changes Jan 13, 2026
In case of errors in the metadata persistence layer, flush would
previously hang until success. This adds an optional timeout for this
case, which will be useful for an upcoming metadata persistence layer
that uses Raft.
Introduces a wrapper around cloud_persistence that replicates and serves
the database manifest from Raft (while maintaining it in object storage
as well). A subsequent commit will introduce usage of this to maintain
a database across replicas of a Raft group.
Introduces a class that wraps lsm::database with the appropriate object
storage classes to be consistent across replica leaders (i.e. different
instances see a consistent view of the database upon leadership
changes).

ss::future<std::optional<iobuf>>
read_manifest(lsm::internal::database_epoch max_epoch) override {
_as.check();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I think this is kind of useless because we never call close until after all callers of this method have returned. Doesn't need to block this PR, we can fix in a followup (I want to integrate @nvartolomei's context thing into the LSM)

// There is no persisted manifest.
co_return std::nullopt;
}
co_return _stm->state().persisted_manifest->buf.copy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably could share this out too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will put out a follow-up (and some shares missed in the other PR)

@andrwng andrwng merged commit 7afd1bb into redpanda-data:dev Jan 14, 2026
19 checks passed
@andrwng andrwng mentioned this pull request Jan 14, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants